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Abstract 

Tensor factorizations are computationally hard problems, and in particular, are often significantly 
harder than their matrix counterparts. In case of Boolean tensor factorizations - where the input tensor 
and all the factors are required to be binary and we use Boolean algebra - much of that hardness comes 
from the possibility of overlapping components. Yet, in many applications we are perfectly happy to 
partition at least one of the modes. In this paper we investigate what consequences does this partitioning 
have on the computational complexity of the Boolean tensor factorizations and present a new algorithm 
for the resulting clustering problem. This algorithm can alternatively be seen as a particularly regularized 
clustering algorithm that can handle extremely high-dimensional observations. We analyse our algorithms 
with the goal of maximizing the similarity and argue that this is more meaningful than minimizing the 
dissimilarity. As a by-product we obtain a PTAS and an efficient 0.828-approximation algorithm for 
rank-1 binary factorizations. Our algorithm for Boolean tensor clustering achieves high scalability, high 
similarity, and good generalization to unseen data with both synthetic and real-world data sets. 


1 Introduction 


Tensors, multi-way generalizations of matrices, are becoming more common in data mining. Binary 3-way 
tensors, for example, can be used to record ternary relations (e.g. source IP—target IP—target port in 
network traffic analysis), a set of different relations between the same entities (e.g. subject—relation—object 
data for information extraction), or different instances of the same relation between the same entities (e.g. 
the adjacency matrices of a dynamic graph over time, such as who sent email to whom in which month). 
Given such data, a typical goal in data mining is to find some structure and regularities from it. To that end, 
tensor decomposition methods are often employed, and if the data is binary, it is natural to restrict also the 
factors to be binary. This gives rise to the various Boolean tensor decompositions that have recently been 
studied (e.g. Miettinen, 2011 Erdos and Miettinen 2013b Cerf et al 2013). 

Finding good Boolean tensor decompositions is, however, computationally hard. Broadly speaking, this 
hardness stems from the complexity caused by the overlapping factors. But when the existing Boolean 
tensor decomposition algorithms are applied to real-world data sets, they often return factors that are 
non-overlapping in at least one mode. Typically the mode where these non-overlapping factors appear is 
unsurprising given the data: in the above examples, it would be target ports in network data, relations in 
information extraction data, and the time (month) in the email data - in all of these cases, this mode has 
somewhat different behavior compared to the other two. 

These observations lead to the question: what if we restrict one mode to non-overlapping factors? This 
removes one of the main sources of the computational complexity - the overlapping factors - from one mode, 
so we would expect the problem to admit better approximability results. We would also expect the resulting 
algorithms to be simpler. Would the computationally simpler problem transfer to better results? In this 
paper we study all these questions and, perhaps surprisingly, give an affirmative yes answer to all of them. 
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Normally, the quality of a clustering or tensor decomposition is measured using the distance of the original 
data to its representation. In Section 3.3 we argue, however, that for many data mining problems (especially 
with binary data), measuring the quality using the similarity instead of distance leads to a more meaningful 
analysis. Motivated by this, the goal of our algorithms is to maximize the similarity (Section |4.1[ ) instead of 
minimizing the dissimilarity. As a by-product of the analysis of our algorithms, we also develop and analyse 
algorithms for maximum-similarity binary rank-1 decompositions in Section [4.2| We show that the problem 
admits a polynomial-time approximation scheme (PTAS) and present a scalable algorithm that achieves a 
2(y/2 — 1) ss 0.828 approximation ratio. The experimental evaluation, in Section [ 5 J shows that our algorithm 
achieves comparable representations of the input tensor when compared to methods that do (Boolean or 
normal) tensor CP decompositions - even when our algorithm is solving a more restricted problem - while 
being generally faster and admitting good generalization to unseen data. 


2 Preliminaries 

Throughout this paper, we indicate vectors as bold lower-case letters (v), matrices as bold upper-case letters 
(M), and tensors as bold upper-case calligraphic letters (T). Element ( i,j , k) of a 3-way tensor X is denoted 
as Xijk- A colon in a subscript denotes taking that mode entirely; for example, X ::k is the kth frontal slice of 
AT (X k in short). 

A tensor can be unfolded into a matrix by arranging its fibers (i.e. its columns, rows, or tubes in case of a 
3-way tensor) as columns of a matrix. For a mode-n matricization, mode -71 fibers are used as the columns 
and the result is denoted by X( n \. 

The outer product of vectors is denoted by IEI. For vectors a, 6, and c of length n, m, and l, X = aEhEc 
is an 7i-by-m-by-^ tensor with Xij k = atbjC k . 

For a tensor X , \X\ denotes its number of non-zero elements. The Frobenius norm of a 3-way tensor X is 
ll*|| = ^/e”” k'^'ijk' ^ * is binary, X — | X | . The similarity between two 7 x- by -7 11 - by- 1 binary tensors X 
and y is defined as sim(Af, y) = nml — \X — y |. 

The Boolean tensor sum of binary tensors X and y is defined as (X V y)ij k = Xij k V y t j k - For binary 
matrices X and Y where X has r columns and Y has r rows their Boolean matrix product, X o Y , is defined 
as (W o Y)ij = VL , x ik y k j. The Boolean matrix rank of a binary matrix A is the least r such that there 
exists a pair of binary matrices (W, Y) of inner dimension r with A = X o Y. 

A binary matrix X is a cluster assignment matrix if each row of X has exactly one non-zero element. In 
that case the Boolean matrix product corresponds to the regular matrix product, 


XoY = XY . 


(1) 


We can now define the Boolean tensor rank and two decompositions: Boolean CP and Tucker3. 

Definition 1. The Boolean tensor rank of a 3-way binary tensor X, rank^(AT), is the least integer r such 
that there exist r triplets of binary vectors (a;, bi , q) with X = \J r i=1 ca IE E . 

The low-rank Boolean tensor CP decomposition is defined as follows. 


Problem 1 (Boolean CP). Given an n-by-m-by-l binary tensor X and an integer r, find binary matrices A 
( 71 -by-r), B (rn-by-r), and C (l-by-r) such that they minimize \X — \J r i=1 a,i E 6, E c,|. 


The standard (non-Boolean) tensor rank and CP decomposition are defined analogously (Kolda and 


Bader 2009). Both finding the least error Boolean CP decomposition and deciding the Boolean tensor rank 


are NP-hard (Miettinen 2011). Following Kolda and Bader (2009), we use [[A, B,C]\ to denote the normal 
3-way CP decomposition and [[A, B,C]]b for the Boolean CP decomposition. 
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Let X be Jii-by-mi and Y be n^Asy-m?. matrix. Their Kronecker (matrix) product, X ® Y, is the 
niri 2 -by-mim 2 matrix defined by 


(x n Y 

x 12 Y ■ 


X21 Y 

X 22 V ■ 

*^2mi ^ 



^ J 


The Khatri-Rao (matrix) product of X and Y is defined as “column-wise Kronecker”. That is, X and 
Y must have the same number of columns (mi = m 2 = m), and their Khatri-Rao product X 0 Y is the 
niri 2 -by-m matrix defined as 


XqY = (x 1 ®y l ,x 2 ®y 2 ,... 1 x m ®y m ) . 


( 2 ) 


Notice that if X and Y are binary, so are X ® Y and X © Y. 

We can write the Boolean CP as matrices using unfolding and matrix products: 

X (1) = Ao(C0Bf 

X {2) =Bo(CQA) t (3) 

X(3) = C o {B Q A) t . 


The Boolean Tucker3 decomposition can be seen as a generalization of the Boolean CP decomposition: 


Problem 2 (Boolean Tucker3). Given an n-by-m-by-Z binary tensor X and three integers rq, r 2 , and r-s, find 
a binary ri-by-r 2 -by-r 3 core tensor Q and binary factor matrices A (n-by-ri), B (m-by-r 2 ), and C (Z-by- 7 ' 3 ) 
such that they minimize 


ri r 2 r 3 

A - \f V V <J °Ai c 7 

OL —1 (3 = 1 7=1 


(4) 


We use [[£; A, B, C]\b as the shorthand notation for the Boolean Tucker3 decomposition. 


3 Problem Definitions 


We consider the variation of tensor clustering where the idea is to cluster one mode of a tensor and potentially 
reduce the dimensionality of the other modes. Another common approach is to do the co-clustering equivalent, 
that is, to cluster each mode of the tensor simultaneously. The former is the approach taken, for example, 
by Huang et al (2008), while the latter appears for instance in Jegelka et al (2009). Unlike either of these 
methods, however, we concentrate on binary data endowed with the Boolean algebra. 


3.1 General Problem 


Assuming a 3-way tensors and that we do the clustering in the last mode, we can express the general Boolean 
tensor clustering (BTC) problem as follows: 


Problem 3 (BTC). Given a binary n-by-m-by-Z tensor X and integers k-\, k 2 , and £ 3 , find a ki-by-k 2 -by-k 3 
binary tensor Q and matrices A £ {0, l} nxfc i, B £ {0, l} mxfe2 , and C £ {0, l } ,xfc 3 such that C is a cluster 
assignment matrix and that [[£/; A, B, C(\b is a good Boolean Tucker decomposition of X. 


If we let k\ = n and /C 2 = m, we obtain essentially a traditional (binary) clustering problem: Given a 
set of l binary matrices {Xj, X 2 ,..., X{\ (the frontal slices of X), cluster these matrices into £3 clusters, 
each represented by an n-by-m binary matrix G t (the frontal slices of the core tensor Q): the factor matrices 
A and B can be left as identity matrices. We call this problem Unconstrained BTC. Writing each of the 
involved matrices as an nm-dimensional vector, we obtain the following problem, called the Discrete Basis 
Partitioning problem (DBPP) in (Miettinen et al 20081: 
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Problem 4 (Miettinen et al DBPP 
find matrices C £ 
minimize II X — CG 


2008). 

{0,l} ,xfc and G £ {0, l} fc 


Given an l-by-nm binary matrix X and a positive integer k, 
xnm suc h that C is a cluster assignment matrix and C and G 


Miettinen et al (2008) show that DBPP is NP-hard and give a (10 + ^-approximation algorithm that 

l, and k, while Jiang (2014]) gives a 2-approximation algorithm that 


runs in polynomial time w.r.t. n, m 
runs in time 0(nml k ). 


3.2 Boolean CP Clustering 

Constraining k\, k- 2 , and Q yields to somewhat different looking problems, though. In what can be seen as 
the other extreme, we can restrict k\ = &2 = k 3 = k and Q (which now is k-by-k-by-k) to hyperdiagonal to 
obtain the Boolean CP clustering (BCPC) problem: 

Problem 5 (BCPC). Given a binary n-by-m-by-l tensor X and an integer k, find matrices A £ {0, l} raxfe , 
B £ {0, l} mxfe , and C £ {0, l}* xfe such that C is a cluster assignment matrix and that [[A, B , C]\b is a 
good Boolean CP decomposition of X. 

What BCPC does is perhaps easiest to understand if we use the unfolding rules ([3]) and write 

X {3) ^Co(BQA) T , (5) 

where we can see that compared to the general BTC, we have restricted the type of centroids: each centroid 
must be a row of type (fo© a) T . This restriction plays a crucial role in the decomposition, as we shall see 
shortly. Notice also that using ([T]) we can rewrite ([5]) without the Boolean product: 

X (3) « C(£? © A) t . (6) 


3.3 Similarity vs. Dissimilarity 

So far we have avoided defining what we mean by “good” clustering. Arguably the most common approach 
is to say that any clustering is good if it minimizes the sum of distances (i.e. dissimilarities) between the 
original elements and their cluster’s representative (or centroid). An alternative is to maximize the similarity 
between the elements and their representatives: if the data and the centroids are binary, this problem is 


known as Hypercube segmentation (Ivleinberg et al 2004). While maximizing similarity is obviously a dual of 


minimizing the dissimilarity - in the sense that an optimal solution to one is also an optimal solution to the 
other - the two behave differently when we aim at approximating the optimal solution. For example, for a 
fixed k. the best known approximation algorithm to DBPP is the aforementioned 2-approximation algorithm 


(Jiang 2014), while Alon and Sudakov (1999) gave a PTAS for the Hypercube segmentation problem. 


The differences between the approximability, and in particular the reasons behind those differences, are 
important for data miners. Namely, it is not uncommon that our ability to minimize the dissimilarities is 
bounded by a family of problems where the optimal solution obtains extremely small dissimilarities, while 
the best polynomial-time algorithm has to do with slightly larger errors. These larger errors, however, might 
still be small enough relative to the data size that we can ignore them. This is what essentially happens 
when we maximize the similarities. 

For example, consider an instance of DBPP where the optimal solution causes ten errors. The best-known 
algorithm can guarantee to not cause more than twenty errors; assuming the data size is, say, 1000-by-500, 
making mistakes on twenty elements can still be considered excellent. On the other hand, should the optimal 
solution make an error on every fourth element, the best approximation could only guarantee to make errors 
in at most every second element. When considering the similarities instead of dissimilarities, these two 
setups look very different: In the first, we get within 0.99998 of the best solution, while in the other, we are 
only within 0.66667 of the optimal. Conversely, if the algorithm wants to be within, say, 0.9 of the optimal 
similarity in the latter setup, it must approximate the dissimilarity within a factor of 1.3. Hence similarity is 
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Algorithm 1 SaBoTeur algorithm for the BCPC 

Input: 3-way binary tensor X. number of clusters k, number of samples r. 
Output: Binary factor matrices A and B, cluster assignment matrix C. 

1: function SaBoTeur(A, k, r) 

2: repeat 

3: Sample k rows of X( 3 ) into matrix Y 

4: Find binary matrices A and B that maximize sim (Y,(BQA) t ) 

5: Cluster C by assigning each row of X( 3 ) to its closest row of (B 0 A) T 

6: until r resamples are done 

7: return best A, B, and C 


less sensitive to small errors (relative to the data size) and more sensitive to large errors, than dissimilarity is. 
In many applications, this is exactly what we want. 

None of this is to say that we should stop considering the error and concentrate only on the similarities. 
On the contrary: knowing the data size and the error, we can easily compute the similarity for the binary 
data, and especially when the error is relatively small, it is much easier to compare than the similarity scores. 
What we do argue, however, is that when analyzing the approximation ratio a data mining algorithm can 
obtain, similarity can give us a more interesting picture of the actual behavior of the algorithm. Hence, for 
the rest of the paper, we shall concentrate on the maximum-similarity variants of the problems; most notably, 
we concentrate on the maximum-similarity BCPC, denoted BCPC max . 


4 Solving the BCPC r 


Given a tensor X, for the optimal solution to BCPC max , we need matrices A, B , and C that maximize 
sim(X (3) , C(B 0 A)^ 1 ). If we replace B 0 A with an arbitrary binary matrix, this would be equal to the 
Hypercube segmentation problem defined by Kleinberg et al ( |2004 ): Given a set S of l vertices of the 
d-dimensional cube {0, l} d , find k vertices P\,, Pk € {0,1}“and a partition of S into k segments to 
maximize X^ceS s ' m (-f 5 o c). Hence our algorithms resembles those for Hypercube segmentation, with 

the added restrictions to the centroid vectors. 


4.1 The Algorithm 

Alon and Sudakov (1999|) gave an algorithm for the Hypercube segmentation problem that obtains a similarity 
within (1 — e) of the optimum. The running time of the algorithm is e°^ fc / e ^ lnk ^nml for n-by-m-by-l data. 
While technically linear in data size, the first term turns the running time unfeasible even for moderate 
values of k (the number of clusters) and e. We therefore base our algorithm on the simpler algorithm by 
Kleinberg et al ( |2004 ) that is based on random sampling. This algorithm obtains an approximation ratio of 
0.828 — £ with constant probability and running time 0(nmlk( 9/e) fc ln(l/e)). While the running time is still 
exponential in fc, it is dominated by the number of samples we do: each sample takes time 0(nmlk) for k 
clusters and n-by-m-by-Z data. For practical purposes, we can keep the number of samples constant (with 
the cost of losing approximation guarantees, though). 

Our algorithm SaBoTeur (SAmpling for Boolean TEnsor clusteRing), Algorithm [l] considers only the 
unfolded tensor X( 3 y In each iteration, it samples k rows of X( 3 ) as the initial, unrestricted centroids. It 
then turns these unrestricted centroids into the restricted type in line [4| and then assigns each row of _X"( 3 ) to 
its closest restricted centroid. The sampling is repeated multiple times, and in the end, the factors that gave 
the highest similarity are returned. 

The algorithm is extremely simple and fast. In line [3] the algorithm samples k rows of the data as its 
initial centroids. Kleinberg et al (2004) proved that among the rows of X( 3 ) that are in the same optimal 
cluster, one is a good approximation of the (unrestricted) centroid of the cluster: 
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Lemma 1 (Kleinberg et al|[2004 Lemma 3.1). Let X be an n-by-m binary matrix and let 


y* = argmax V'sim (xi,y) . 
y6{o,i}™ i=1 


Then there exist a row Xj of X such that 

n n 

Y sim(x i , Xj) > (2a/2 - 2) y sim(a 0 y*) . (7) 

z — 1 i=l 

That is, if we sample one row from each cluster, we have a high probability of inducing a close-optimal 
clustering. 


4.2 Binary Rank-1 Matrix Decompositions 

The next part of the SaBoTeur algorithm is to turn the unrestricted centroids into the restricted form 
(line [4]). We start by showing that this problem is equivalent to finding the maximum-similarity binary rank -1 
decomposition of a binary matrix: 

Problem 6 (Binary rank-1 decomposition). Given an n-by-m binary matrix X, find an n-dimensional binary 
vector a and an m-dimensional binary vector b that maximize sim(_X), a Mb). 


Lemma 2. Given an k-by-nm binary matrix X, finding n-by-k and m-by-k binary matrices A, B that 
maximize sim(4f, (B 0 A) T ) is equivalent to finding the most similar binary rank-1 approximation of each 
row x of X, where the rows are re-shaped as n-by-m binary matrices. 


Proof. If Xi is the ith row of X and z. L is the corresponding row of (B © A ) T , then sim(W, (B 0 A) T ) = 
Yi =1 sim(a 0 Zi ), and hence we can solve the problem row-by-row. Let x = (a 0 i, £ 2 , 1 , ■ ■ ■, 20 , 1 , £ 1 , 2 , ■ • •, 20 m ) 
be a row of X. Re-write x as an n-by-m matrix Y , 


201 

202 • 

’ ^1 

201 

202 • 

’ %2,m 

2-n,l 

2-n, 2 

%n,m J 


Consider the row of (B © A) T that corresponds to x , and notice that it can be written as (b 0 a ) T , 
where a and b are the columns of A and B that correspond to s. As (b 0 a) T = (61 a T , 62 a 7 , • • • ,b m a T ), 
re-writing it similarly as x we obtain 


( aibi 

aib 2 ■ 

• &i6 m \ 

a 2 bi 

a 2 b 2 ■ 

• <^2^771 

\a n bi 

a n b 2 ■ 

dn^m J 


ab T = aM b . 


Therefore, sim(cc, (b 0 a) T ) = sim(Y\ aMb). 


□ 


We start by showing that the maximum-similarity binary rank-1 decomposition admits a PTAS. To that 
end, we present a family of randomized algorithms that on expectation attain a similarity within (1 — e) 
of the optimum. The family is similar to that of Alon and Sudakov s (1999) and can be analysed and 


de-randomized following the techniques presented by them. The family of algorithms (one for each e) is 
presented as Algorithm [2] 

Lines [7] and | 8 ] require us to solve one of the factor vectors when the other is given. To build b (line [7]), we 
simply take the column-wise majority element of those rows where a is 1 , and we extend a similarly in line [ 8 ] 
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Algorithm 2 Randomized PTAS for maximum-similarity binary rank-1 decompositions 
Input: An n-by-m binary matrix X. 

Output: Binary vectors a and b. 

1: function PTAS e (X) 

2 : Sample l = 0(e -2 ) rows of X u.a.r. with replacements 

3: for all partitions of the sample into two sets do 

4: Let X' contain the rows in the sample 

5: for both sets in the partition do 

6 : Let a be the incidence vector of the set 

7: Find b T maximizing sirr^X', ab T ) 

8 : Extend a to the rows not in the sample maximizing sim(X, ab T ) 

9: return best vectors a and b 


Algorithm 3 Approximation of maximum-similarity binary rank-1 decompositions 
Input: An n-by-m binary matrix X. 

Output: Binary vectors a and b. 

1: function A(X) 

2: for all rows x t of X do 

3: Let b = Xi 

4: Find a maximizing sim(X, ab T ) 

5: return best vectors a and b 


The analysis of Algorithm [2] follows closely the proofs by Alon and Sudakov (1999) and is omitted. 

The running time of the PTAS algorithm becomes prohibitive even with moderate values of e, and therefore 
we will not use it in SaBoTeur. Instead, we present a simple, deterministic algorithm that approximates the 
maximum similarity within 0.828, Algorithm [3] It is similar to the algorithm for Hypercube segmentation 
based on random sampling presented by Kleinberg et al (2004). The algorithm considers every row of X as 
a potential vector b and finds the best a given b. Using Lemma [l] it is straight forward to show that the 
algorithm achieves the claimed approximation ratio: 


Lemma 3. Algorithm^ approximates the optimum similarity within 0.828 in time 0(nm min{n, m}). 

Proof. To prove the approximation ratio, let a*(b*) T be the optimum decomposition. Consider the rows 
in which a* has 1. Per Lemma 111 selecting one of these rows, call it 6, gives us sim(X,a* b T ) > (2^2 - 
2)sim(X, a*b T ) (notice that a*hr agrees with the optimal solution in rows where a* is zero). Selecting a 
that maximizes the similarity given b can only improve the result, and the claim follows as we try every row 
of X. 

If n < to, the time complexity follows as for every candidate b we have to make one sweep over the matrix. 
If to < n, we can operate on the transpose. □ 


4.3 Iterative Updates 

The SaBoTeur algorithm bears resemblance to the initialization phase of fc-means style clustering algorithms. 
We propose a variation of SaBoTeur, SaBoTeur+itUp, where fc-means style iterative updates are performed on 
the result of SaBoTeur until it does not improve further. In each iteration of the iterative updates phase, the 
new centroids are first determined by placing Is in those positions where the majority of cluster members have 
a 1. Next, these new centroids are constrained to rank-1, and finally, the cluster assignment is recomputed. 

This procedure clearly slows down the algorithm. Also, due to the constraint on the centroids the iterative 
updates might actually impair the result, unlike in the normal fc-means algorithm that converges to a local 
optimum. However, as we shall see in Section [5j the results in practice improve slightly. 
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4.4 Implementation Details 

We implemented the SaBoTeur algorithm in cQ In the implementation, the tensor X is represented as 
a bitmap. This allows for using fast vectorized instructions such as xor and popcnt. In addition this 
representation takes exactly one bit per entry (excluding necessary padding), being very memory efficient 
even compared to sparse tensor representations. 

The second ingredient to the fast algorithm we present is parallelization. As every row of Xj 3 ) is handled 
independently, the algorithm is embarrassingly parallel. We use the OpenMP (Dagum and Menon, 1998[ ) and 
parallelize along the sampled cluster centroids. This means that the rank-1 decompositions of each centroid 
as well as the computation of the similarity in each of the columns of the Khatri-Rao product (line [4] of 
Algorithm [I]) are computed in parallel. 


4.5 Selecting the Number of Clusters 

A common problem in data analysis is that many methods require the selection of the model order a priori; for 
example, most low-rank matrix factorization methods require the user to provide the target rank before the 
algorithm starts. Many clustering methods - ours included - assume the number of clusters as a parameter, 
although there are other methods that do not have this requirement. 

When the user does not know what would be a proper number of clusters for the data, we propose the 
use of the minimum description length principle (MDL, Rissanen 1978) for automatically inferring it. In 


particular, we adopt the recent work of Miettinen and Vreeken (2014) on using MDL for Boolean matrix 
factorization (BMF) to our setting (we refer the reader to Miettinen and Vreeken| (2014) for more information). 

The intuition behind the MDL principle is that the best model for the data (best number of clusters, in 
our case) is the one that lets us compress the data most. We use the two-part (i.e. crude) MDL where the 
description length contains two parts: the description length of the model M, L(M ), and that of the data D 
given the model M , L(D \ M). The overall description length is simply L(M) + L(D \ M). The optimal 
model (in MDL’s sense) is the one where L(M) + L(D \ M) is minimized. 

In BCPC, the model consists of the two factor matrices A and B and the cluster assignment matrix C. 
Given these, we can reconstruct the original tensor X if we know the locations where [[A, B , C]]b differs from 
X. Therefore, L(M) is the total length of encoding A, B , and C, while L(D \ M ) is the length of encoding 
the locations of the differences, i.e. tensor X © [[A, B, C]]b, where © is the element-wise exclusive-or. 

Encoding A and B is similar to BMF, so we can use the DtM encoding of |Miettinen and Vreeken (2014) 
for encoding them (and the related dimensions of the data). To encode the matrix C, we only need to store 
to which cluster each frontal slice is associated to, taking l ■ log 2 (fc) bits. This finalises the computation of 
L(M). 

To compute L(D | M), we note that we can re-write X © [[A, B , C]]b = ^( 3 ) © C o (B o A) T . The 
computation of L(D \ M ) now becomes equivalent of computing it for Boolean matrix factorization with 
factor matrices C and D = (B o A) T . Hence, we can follow the Typed XOR DtM approach of Miettinen 
and Vreeken (2014) directly. 


To sum up, in order to use MDL to select the number of clusters, we have to run SaBoTeur with different 
numbers of clusters, compute the description length for each result, and take the one that yields the smallest 
description length. 


4.6 Maximum-Similarity BTC 

We note here that as the Hypercube segmentation problem is the maximization version of the DBPP 
(Problem [4j , the algorithms of Kleinberg et al (2004) and Alon and Sudakov (19991 can be used as such for 
solving the maximum-similarity Unconstrained BTC. 


lr The code is available from http://www.mpi-inf.mpg.de/~pmiettin/btc/ 









































5 Experimental Evaluation 


5.1 Other Methods and Evaluation Criteria 

To the best of our knowledge, this paper presents the first algorithms for Boolean tensor clustering and 
hence we cannot compare directly to other methods. We decided to compare SaBoTeur to continuous and 
Boolean tensor CP decompositions. We did not use other tensor clustering methods as they aim at optimizing 
significantly different targets (see Se ction [6] for more elaborate discussion). 

We used the BCP_ALS ( |Miettinen| 20111 and Walk ’n’Merge (Erdos and Miettinen 2013b) algorithms for 
computing Boolean CP decompositions. BCP_ALS is based on iteratively updating the factor matrices one at 
a time (similarly to the classical alternating least squares optimizations), while Walk’n^ Merge is a recent 
algorithm for scalable Boolean tensor factorization in sparse binary tensors. We did not use Walk’n’ Merge on 


synthetic data as BCP_ALS is expected to perform better on smaller and denser tensors (Erdos and Miettinen 


2013bI, but we used it on some larger real-world tensors; BCP.ALS, on the other hand, does not scale well to 


larger tensors and hence we had to omit it from most real-world experiments. 

_ Of the continuous methods we used ParCube ( Papalexakis et al 2012^ and CP_APR (Chi and Kolda 

2012) (implementation from the Matlab Tensor Toolbox v2.(Q. CP_APR is an alternating Poisson regression 


algorithm that is specifically developed for sparse (counting) data (which can be expected to follow the 
Poisson distribution), with the goal of returning sparse factors. 

The other method, ParCube, uses sampling to find smaller sub-tensors. It then solves the CP decomposition 
in this sub-tensor, and merges the solutions back into one. We used a non-negative variant of ParCube that 
expects non-negative data, and returns non-negative factor matrices. 

For synthetic data, we report the relative similarity, that is, the fraction of the elements where the data 
and the clustering agree. For real-world data, we report the error measured using squared Frobenius norm 
(i.e. the number of disagreements between the data and the clustering when both are binary). Comparing 
binary methods against continuous ones causes issues, however. Using the squared Frobenius can help the 
real-valued methods, as it scales all errors less than 1 down, but at the same time, small errors cumulate 
unlike with fully binary data. To alleviate this problem, we also rounded the reconstructed tensors from 
CP_APR and ParCube to binary tensors. We tried different rounding thresholds between 0 and 1 and selected 
the one that gave the lowest (Boolean) reconstruction error. The rounded versions are denoted by CP_APR 0 / 1 
and ParCube 0/ q. 

It is worth emphasizing that all of the methods we are comparing against are solving a relaxed version of 
the BCPC problem. Compared to BCPC, the Boolean CP factorization is not restricted to clustering the 
third mode while the normal CP factorization lifts even the requirement for binary factor matrices. Hence, 
a valid solution to BCPC is always also a valid solution to (Boolean and normal) CP factorization (notice, 
however, that the solutions for Boolean and normal CP factorization are not interchangeable). The methods 
we compare against should therefore always perform better than (or at least as good as) SaBoTeur. 


5.2 Synthetic Experiments 

To test the SaBoTeur algorithm in a controlled environment, we created synthetic data sets that measured the 
algorithm’s response to (1) different numbers of clusters, (2) different density of data, (3) different levels of 
additive noise, and (4) different levels of destructive noise. All tensors were of size 700-by-500-by-50. All data 
sets were created by first creating ground-truth binary factor matrices A, B, and C. The default number of 
clusters was 7 and the default density of the tensor X = [[A, B , C]]b was 0.05. 

Additive and destructive noise were applied to the tensor X . Additive noise turns zeros into ones while 
destructive noise turns ones into zeros. The default noise leveQfor both types was 10%, yielding to a noised 
input tensor X . 

^http://www.cs.emu.edu/~epapalex/ 

"http://www.sandia.gov/~tgkolda/TensorToolbox/ 

4 'l'he noise levels are reported w.r.t. number of non-zeros. 
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(d) Destructive noise 



(b) Additive noise, compared to the 
original data 
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Destructive noise level 
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Figure 1: Results from synthetic experiments when varying different data characteristics. All j/-axes show the 
relative similarity. All markers are mean values over 5 iterations and the width of the error bars is twice the 
standard deviation. 


We varied each of the four features one at a time keeping the others in their default values, and created 5 
random copies on each parameter combination. The results we report are mean values over these five random 
copies. In all experiments, the number of clusters (or factors) was set to the true number of clusters used to 
create the data. The number of re-samples in SaBoTeur was set to r = 20 in all experiments. 

For similarity experiments, we compared the obtained reconstructions [[A, B , C]]s to both the original 
tensor X and the noised tensor X. 


5.2.1 Varying the Noise 


In the first experiment, we studied the effects different noise levels have to the reconstruction accuracy. 
First, we varied the level of additive noise from 5% to 50% (in steps of 5%). The results are in Figure [T(a)j 
where we can see that SaBoTeur (with and without iterative updates) and CP_APRo/i all achieve very high 
similarity with a gentle downwards slope as the level of additive noise increases. BCP.ALS is slightly worse, but 
consistent, while the performance of ParCubeg^ suffers significantly from increased noise levels. In order to 
test if the good performance of SaBoTeur means it is just modeling the noise, we also compared the similarity 
of the reconstruction to the original, noise-free tensor X (Figure fl (b)| ). This did not change the results in 
meaningful ways, except that BCP_ALS improved somewhat. 

When moving from additive to destructive noise (Figure|l(d)|, SaBoTeur, SaBoTeur+itUp, and CP_APR 0 /i 
still stay as the best three methods, but as the destructive noise level increases, BCP_ALS approaches the three. 
That behaviour, however, is mostly driven by ‘modeling the noise’, as can be seen in Figure l(e)| which shows 
the similarity to the noise-free data. There, with highest levels of destructive noise, SaBoTeur’s performance 
suffers more than CP_APR 0 /i’s. 
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(a) Speed vs. number of clusters 
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mode 
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(c) Speed vs. number of threads 


Figure 2: Results from scalability tests. All times are wall-clock times. Markers are mean values over 5 
iterations and the width of the error bars is twice the standard deviation. 


5.2.2 Varying the Number of Clusters 

The number of clusters varied from 3 to 15 with steps of 2. The results are shown in Figure l(c)| None of the 
tested methods show any significant effect to the number of clusters, and the best three methods are still 
SaBoTeur, SaBoTeur+itUp, and CP_APR 0 /i. 


5.2.3 Varying the Density 

The density of the tensor varied from 5% to 30% with steps of 5%. The results can be seen in Figure [1(f)] 
All methods perform worse with denser data, with CP_APRo/i, SaBoTeur+itUp, and SaBoTeur being the best 
three methods in that order. BCP_ALS and ParCube’s results quickly went below 75% similarity, ParCube 0 /i 
going as low as 55% with 30% density (we omit the worst results from the plots for clarity). Note that an 
increased density implies higher amounts of noise as the level of noise is defined with respect to the number 
of ones. 


5.2.4 Scalability 

We run the scalability tests on a dedicated machine with two Intel Xeon X5650 6-Core CPUs at 2.66GHz 
and 64GB of main memory. All reported times are wall-clock times. For these experiments, we created a 
new set of tensors with a default size of 800-by-800-by-500, a density of 5%, 10% of additive and destructive 
noise, and 20 clusters by default. During the experiments, we varied the size, the number of clusters, and the 
number of threads used for the computation. 

We also tried ParCube, CP_APR, and BCP_ALS with these data sets, but only ParCube was able to finish 
any of the data sets. However, already with 200-by-200-by-500 data, the smallest we tried, it took 155.4 
seconds. With 10 clusters, the least number we used, ParCube finished only after 1319.4 seconds. Therefore, 
we omitted these results from the figures. 

As can be seen in Figure |2(a)| the SaBoTeur algorithm scales very well with the number of clusters. 
This is mostly due to efficient parallelization of the computing of the clusters (c.f. below). SaBoTeur+itUp, 
however, slows down with higher number of clusters. This is to be expected, given that more clusters require 
more update computations. 

For the second experiment, we varied the dimensionality of the first and second mode between 200 and 
3400 with steps of 400. The results can be seen in Figure [2(b)| As we grew both modes simultaneously, the 
size of the tensor grows as a square (and, as the density was kept constant, so does the number of non-zeros). 
Given this, the running time of SaBoTeur grows as expected, or even slower, while SaBoTeur+itUp is again 
clearly slower. 

In the last scalability experiment (Figure [2(c)|>, we tested how well SaBoTeur parallelizes. As the computer 
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(a) Test similarity w.r.t. noised (solid (b) Cohen’s k (c) Mutual information 

lines) and original data (dashed lines) 

Figure 3: Evaluation of the generalization quality using (a) reconstruction error on test data, and (b) Kappa 
statistic and (c) mutual information of cluster labels between the obtained and true labels in the test data. 
Markers are mean values over 5 different data sets and the width of the error bars is twice the standard 
deviation. 


used had 12 cores, we set that as the maximum number of threads. Yet, after 9 threads, there were no more 
obvious benefits. We expect that the memory bus is the limiting factor here. 


5.2.5 Generalization Tests 

In our final test with synthetic data, we tested how well SaBoTeur+itUp’s clusterings generalize to yet-unseen 
data. Our hypothesis is that restricting the centroids to rank-1 matrices helps with the overfitting, and 
hence we compared SaBoTeur+itUp to unrestricted Boolean tensor clustering (BTC), where the centroids are 
arbitrary binary matrices. Notice that Unrestricted BTC will always obtain at least as good reconstruction 
error on the training data as SaBoTeur+itUp. 

We generated tensors of size 700-by-500-by-300 with 7 clusters, 15% of density and different levels of both 
additive and destructive noise, from 10% till 50%. We randomly selected 25% of the frontal slices as the test 
data and computed the clusterings on the remaining data. We then connected each frontal slice in the test 
set to its closest centroid. 

The quality of the results was measured in two different ways. First, we used the reconstruction error 
in the test data (Figure [3 (a) | ) and computed the overall similarity of expressing the testing data using the 
centroids. We also used the original (i.e. noise-free) frontal slices to assess whether we ‘model the noise’. The 
results show that with the noise-free test data (dashed lines), SaBoTeur+itUp is better than the unrestricted 
BTC with 30-40% noise, and the results are reversed with 10% and 50% of noise. With noisy test data, the 
differences are less. 

We also measured the similarity of the cluster labels w.r.t. the ground truth. We used two metrics for 
that, Cohen’s re statistic (Figure [3(b)[ ) and normalized mutual information (Figure |3(c)[ ). Cohen’s re takes 
values from [—1,1], —1 meaning complete disagreement and 1 meaning complete agreement. As Cohen’s re 
requires the clusters’ labels to match, we paired the labels using bipartite maximum matching (we used the 
Hungarian method (Papadimitriou and Steiglitz 1998) to compute the matching). The mutual information 
was normalized by dividing it with the joint entropy; the resulting statistic takes values from [0,1], with 1 
meaning perfect agreement. 

The two measures agree that SaBoTeur+itUp is better at 40% of noise and worse with 10% and 50% of 
noise. With 10% of noise, the difference is not significant with either metric, and with 50% of noise, the 
difference is significant only with normalized mutual information. 

Overall, the results show that SaBoTeur+itUp has a small advantage over the unrestricted BTC. Perhaps 
the clearest sign of the benefits of the rank-1 centroids is the consistently smaller standard deviation 
SaBoTeur+itUp obtains. This suggests that SaBoTeur+itUp is less susceptible to random variations in the 
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Table 1: Size and density of the real-world datasets. 


Dataset 

Rows 

Columns 

Tubes 

Density (10 7 ) 

Delicious 

1640 

23584 

7968 

8.23 

Enron 

146 

146 

38 

22752.86 

Facebook 

42390 

39986 

224 

16.51 

Last.FM 

1892 

12523 

9749 

8.07 

MovieLens 

2113 

5908 

9079 

4.23 

MAG 

10197 

10673 

20 

1036.37 

Resolver 

343 

360 

200 

626.01 

TracePort 

10266 

8622 

501 

2.51 

YAGO 

190962 

62428 

25 

67.35 


data, which should improve the generalization quality, 
real-world data (Section 5.3.51 support this. 


Indeed, our generalization experiments with the 


5.3 Real-World Data 

We tested SaBoTeur also with multiple real-world data sets of varying characteristics. The purpose of these 
experiments is to verify our findings with synthetic data. To that end, we studied how well SaBoTeur (and 
the other methods) can reconstruct the data, how the methods scale with real-world data, and how SaBoTeur 
generalizes to unknown data. In addition, we also studied the sparsity of the factors, using MDL to select the 
number of clusters, and how interpretable the results SaBoTeur provides are. 


5.3.1 Data Sets 


We use d nine real-wor ld data sets. The size and density for each data set is listed in Table [lj The Delicious 
datcp] (Canta dor et al, 2011) contains user bookmark-tag tuples from the Delicious social bookmarking 
systeirj^] The E nron dat r^Jcont ains information about who sent an e-mail to whom (rows and columns) per 
months (tubes). The Facebook data setpj ( |Viswanath et al 2009) contains information about who posted 
a message on whose wall per week. The Last.FM datcj^ (Cantador et al 2011) consists of artist-user-tag 
tuples from the Last.F M online music systerr p] The MovieLens data set and the MAG data set are obtained 
from the same source^ (Cantador et al, 2011]). While MovieLens is composed of user-movie-tag tuples, MAG 
consists of movies-actors-genres tuples. The Resolver data contains entity-rel ation-entity tuples from the 
TextRunner open information extraction algorithm^] (|Yates and Etzioni 2009). We used the sample called 
ResolverL in Miettinen] (2011). The TracePort data sel| iJ j contains anonymized passive traffic traces (source 
and destination IP and port numbers) from 2009. The YAGO data set consist of subject-object-relation 
tuples from the semantic knowledge base YAGCp^] ([Suchanek et al 2007). 

As a preprocessing step we removed all the all-zero slices in every mode. In addition, for the Delicious 
data set, also all slices that had less than 6 entries were purged. For the MAG data, all actors that appeared 
in less than 5 movies were discarded. 


http: //grouplens. org/datasets/hetrec-2011 
f http://www.delicious.com 
'http://www.cs.emu.edu/~enron/ 

^http://socialnetworks.mpi-sws.org/datasets.html 
http: / /www. last. f m 

1( http://www.cis.temple.edu/~yates/papers/j air-resolver.html 
11 http://www.caida.org/data/passive/passive_2009_dataset.xml 
http: //www. mpi-inf .mpg. de/yago-naga/yago 
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5.3.2 Reconstruction Accuracy 


As all real-world data sets are very sparse, we report the errors, not the similarity, for these experiments. 
The error is measured as the number of disagreements, when the reconstructed tensor is also binary, or as 
squared tensor Frobenius distance, when the reconstructed tensor is real-valued (not-rounded versions of 
CP_APR and ParCube). The results can be seen in Table[2] The results for CP_APR 0 /i and ParCube 0 /i for all 
data sets except Enron and Resolver are obtained by sampling: If \X\ is the number of non-zeros in data AT, 
we sampled 200 \ X\ locations of Os, and computed the error the methods make for every non-zero and for the 
sampled zero elements. The sampled results were then extrapolated to the size of the full tensor. We had to 
use sampling, as the factor matrices were too dense to allow reconstructing the full tensor. We will discuss 
more on density below. 

The smallest reconstruction error is obtained by ParCubeg/x in almost all experiments, YAGO being a 
notable exception. Remember, however, that ParCube 0 /i returns a non-negative tensor CP decomposition 
that is afterwards rounded to binary tensor, that is, it is neither clustering nor Boolean, and hence together 
with CP_APR 0 /i is expected to be better than SaBoTeur. Also, without the rounding, ParCube is often the 
worst method by a large margin. CP_APR benefits much less from the rounding, CP_APRq/ 1 being comparable 
to SaBoTeur. 

In those data sets where we were able to run BCP_ALS, its results were consistently worse than SaBoTeur’s 
results, despite it solving more relaxed problem. We were unable to get Walk'ndMerge to finish within a 
reasonable time with most data sets: it found rank-1 tensors sufficiently quickly, but took too much time to 


select the top ones from there (see Erdos and Miettinen (2013b) for explanation on how Walk’n’Merge finds 


a Boolean CP decomposition). When it did finish, however, it was slightly better than SaBoTeur+itUp or 
SaBoTeur. 

Finally, the iterative updates of SaBoTeur+itUp consistently improved the results compared to SaBoTeur, 
but did so with significant increase to the running time. It seems that with most data sets, the iterative 
updates are not necessarily worth the extra wait. 


5.3.3 Sparsity of the Factors 

An important question on the practical feasibility of the tensor decomposition and clustering algorithms is 
the density of the factor matrices. Too dense factor matrices increase the computational complexity and 
also the storage requirements. Furthermore, multiplying the dense factors together becomes prohibitively 
expensive, making it impossible to fully re-construct the tensor. This is the reason why, for example, we had 
to use sampling to compute the rounded representations of CP_APR and ParCube. 

We studied the sparsity of the factors using the Delicious data. In Table [3] we report the total densities of 
the matrices A and B , that is, (jA| + \B\)/k{n + to) for n-by-k and m-by-k factors. 

It is obvious that the Boolean methods, SaBoTeur and Walk Merge, are orders of magnitude sparser 
than the continuous methods (CP_APR and ParCube). This was to be expected as similar behaviour has 

2010 ). 


already been observed with Boolean matrix factorizations (Miettinen 


We did not compare the third-mode factor matrices for densities. For SaBoTeur, the clustering assignment 
matrix C has density l/l that depends only on the number of frontal slices l ; obtaining density less than that 
requires having rows of C that have no non-zeros. 


5.3.4 Scalability with Real-World Data 

Table [4] shows the wall-clock times for MovieLens and Resolver with k = 15. MovieLens is one of the sparsest 
data sets, while Resolver is one of the densest ones, and hence these two show us how the relative speed of 
the methods changes as the density changes. The other data sets generally followed the behaviour we report 
here, adjusting to their density. 

With theMovieLens data, ParCube is the fastest of the methods, followed by SaBoTeur. CP_APR, Walk 1 n’ Merge, 
and SaBoTeur+itUp are significantly slower than the two. With the denser Resolver data, however, the order 
is changed, with SaBoTeur and SaBoTeur+itUp being an order of magnitude faster than Walkin’Merge, 


14 










Xi 

P 


O 

P 

CO 

P 

£ 


O 

bjO 


05 


P 


P 

P 

05 


05 

05 

Pi 

P 


P 

05 

P2 

O 

£ 


Pi 

05 

H-J 

X5 

05 

Sh 

P 

& 


X5 

05 


P 

05 


P 

X5 


P 

05 


Pi 

O 


Pi 

O 


05 

P 


P 

o 

05 

05 


V PP 


X5 




o 

05 


CO 

no 

t-H 

CO 




00 

00 


05 

no 

CM 

CO 



l^_ 

no 

^F 


CO 

t-H 

t-H 

05 




CO 

CO 


^F 

no 

CM 

t-H 




CM 

CM 


CM 

CM 

CM 

CM 




CM 

CM 


CM 

CM 

CO 

CM 


< 


CM 

05 

1 

CO 

O 

o 

CO 

1 



CM 

CO 


05 

oo 

05 

CO 



no 

CO 

00 


05 

iO 

CM 

t-H 




CO 



no 

no 

CM 




CM 

CM 


CM 

CM 

b~ 

CM 




CM 

CM 


CM 

CM 

CO 

CM 




oo 

05 


05 

05 

t-H 

CM 

■ 



t- 

05 


LO 

CO 

CO 

CM 



o 

CM 

CO 


CO 

05 

CO 

CO 



co 

no 

iO 


CM 

O 

o 

t-H 




05 

00 


00 

t- 

i> 

CO 




t-H 

t-H 


t-H 

t-H 

b- 

t-H 









CM 





O 

r-H 

1 

05 


t-H 

t-H 

1 



t- 

no 


00 

05 

^F 

CO 



o 

no 

CO 


-f 


05 

^F 



CM 

no 

no 


CO 


CO 

05 


*s 


05 

00 


00 

t- 

no 

t-H 




r-H 

r-H 


i~H 


00 

t-H 


LL 







CM 



+-* 

l/> 


no 

co 

1 

00 

t- 

t-H 

b- 

1 

ra 


CO 

r-H 


CO 

CM 

00 

o 


_l 

no 

t- 

t- 


o 

no 

CO 

CM 



H 

no 

no 


-F 

b- 

05 

05 




05 

00 


oo 

b- 

r- 

CO 




r-H 



t-H 

T—1 

oo 

t-H 









CM 





no 

CM 

1 

CO 

CO 

05 

r- 

1 



O 

b— 


T—i 

CO 

lO 

no 




05 

o 


no 

b- 

CM 

o 



oo 

iO 

CO 


-F 

o 

no 

co 




05 

oo 


oo 

oo 

^F 

co 




t-H 

r-H 


t-H 

t-H 

lb 

t-H 









CM 





CO 

no 

o 

o 

t-H 

O 

o 

| 


no 

no 

CO 

no 

CO 

00 

r- 

05 



r-H 

b- 

b- 

00 

no 

b- 

CM 

CO 




r-H 

T—1 

r-H 

t-H 

t-H 

CM 

t-H 




05 

o 

o 

00 

t-H 

CM 

t-H 

co 


CM 

co 

iO 

no 

05 


no 

no 

iO 

c 

r-H 

b— 

b— 

00 

iO 

00 

CO 

t- 

t- 

o 


r-H 

t-H 

t-H 

t-H 

t-H 

CM 

t-H 

t-H 











c 


05 

CO 

o 

t-H 

b- 

co 


| 

LU 

o 

b- 

no 

no 

CO 

t-H 






b— 

b- 

00 

CO 

00 

CM 

t- 




r-H 

1—1 

r-H 

t-H 

t-H 

CM 

t-H 




r-H 

CO 

o 

oo 

00 

b- 

CM 

1 


no 

r-H 

05 

no 

t-H 

CO 

CO 

O 



00 

b- 

00 

t- 

00 

t-H 

00 




r-H 

t-H 

t-H 

t-H 

t-H 

CM 

t-H 




^F 

05 

| 

CO 

05 

CM 

o 

■ 



o 

iO 


05 

CO 

no 

iO 



o 

o 

00 


t-H 

CO 

CM 

00 



CO 

CO 

CM 


CO 

co 

oo 

t-H 




no 

no 


no 

no 

t-H 





CM 

CM 


CM 

CM 

^F 

CM 




b- 

oo 

1 

CO 

CM 

t-H 

CO 

1 




CO 


CO 

no 

00 

t-H 



iO 

^F 

t- 



CO 

no 

o 


CO 


CO 

CM 


CO 

co 

CO 

CO 


3 


no 

no 


no 

no 

CO 

^cF 


O 


CM 

CM 


CM 

CM 

CO 

CM 


'u 










;_ 


co 

co 


CO 

CO 

o 

CO 


05 


^F 

05 


t-H 

no 

t-H 

CO 

CO 

Q 

o 

no 

CM 


no 

CO 

b- 

iO 

o 


r-H 

CO 

CO 


CO 

CO 

co 

CO 

t-H 



no 

no 


no 

no 

b- 

'vF 

iO 



CM 

CM 


CM 

CM 

co 

CM 

CM 



00 

co 

1 

oo 

co 

t-H 

05 

| 



o 

no 


no 

no 

CM 

CM 



. 

co 



no 

CO 

no 

t-H 



I'¬ 

CO 

CO 


CO 

CO 

b- 

t- 




no 

no 


no 

no 

co 





CM 

CM 


CM 

CM 

CO 

CM 



ll 


Qh 










P 






CD 




-P 






faO 




•H 






Pi 




+ 






CD 



Pi 

P 



rH 


o 

s; 



0 

0 

CO 


o 

CD 

CD 




CD 

CD 

l-J 

Ph 

Ptf 

rO 

rO 

0 



H 

H 

< 

Ph 

Ph 

0 

0 




O 

O 


<u 


O 

O 

M 



CQ 

CQ 

CP 



P 

P 

i—1 



0 

0 

o 

Oh 

Oh 

0 

cd 

cd 



CO 

CO 

CQ 

o 

o 

Qh 

Qh 





b~ 

1 



| 

CO 

O 

| 



CO 

1 



1 

CO 

05 

1 

o 







b- 

CO 




t- 





CO 

CO 



b- 

lO 





b- 

05 


< 


05 






b- 


> 


t-H 





05 

^F 









t-H 










CO 





CO 

1 

| 

CM 

no 


'nF 

1 


o 

b- 

1 


O 


CO 

05 

1 


CM 

b- 



CM 

05 

no 

00 



CO 



CO 

CO 

b- 

o 


o 


CM 



CM 

CM 

05 

CM 


o 

o 


CO 



CO 

CO 

b- 

CO 


CD 


05 

1 

| 

co 

no 

T—1 

05 

| 

ro 


05 

1 


TF 

^F 

05 

CO 

1 

11 

lO 

05 



CO 

05 

00 

05 



t-H 

CO 



CO 

CO 

O 

05 




CM 



CM 

CM 

no 

t-H 




CO 



CO 

CO 

t- 

CO 




CO 

co 

, 

rr- 

t- 

co 

b- 

| 


O 

b- 

b- 


CM 

t-H 

05 

t cF 

1 


co 

b- 

co 


o 

CO 

t- 

CO 




o 

o 


t-H 

O 

05 

05 




t-H 

1 


t-H 

1 

CM 





co 

co 

| 

CM 

CO 

b- 


05 

4_> 


CM 

t-H 


00 

05 

O 

05 

b- 

L. 


05 

05 


o 

00 


o 

co 

o 


O 

O 


t-H 

o 

t—H 

o 

o 

Q_ 


t-H 

t-H 


t-H 

t-H 

CO 

t-H 

t-H 

CD 










U 


"nF 

t-H 


"nF 

05 

CO 

CO 

| 

CT3 

o 

o 

CO 


t—H 

no 

05 

r-H 

1 


t-H 

o 

05 


t-H 

O 

b- 

05 


1 


t-H 

O 


t-H 

t-H 

o 

05 




t-H 

1 


t-H 

1 

CO 





lO 

O 

| 

o 

O 

CO 

00 

1 



00 

05 


"nF 

t-H 

c 

O 

1 


lO 

O 

05 


t-H 

t-H 

00 

b~ 




t-H 

O 


t— H 

t-H 

05 

o 




t-H 



t-H 


t-H 

t-H 



o 

lO 

co 

CM 

t- 

05 

05 

o 

| 


co 

lO 

^F 

CO 

lO 

CO 

CO 

I> 

1 



'nF 

xF 

CO 

'sF 

no 

05 

CO 




t-H 

1 

r-H 

r—1 

1 

t-H 

t-H 




lO 


CO 

05 

00 

00 

o 

"nF 


i 

oo 

lO 

CM 

00 

CO 

05 

no 

co 

CD 


'nF 

^F 

CO 

'nF 

no 

b- 

"nF 

no 

> 


t-H 

t-H 

t-H 

T—1 

t-H 

T—1 

t-H 

t-H 

O 










(/) 

_ 

05 

CO 

^F 

05 

no 

CM 

t-H 

| 

CD 


O 

O' 

CM 

o 


CO 

t- 

1 



lO 

iO 

CO 

no 

no 

b- 

^F 




t-H 

1 

1 

t-H 

1 

t-H 

t-H 




o 

CM 

■'sF 

05 

no 


b- 

1 



CO 

CM 

CM 

t-H 

^F 

00 

o 

1 


iO 

lO 

lO 

CO 

no 

no 

b- 

no 




t-H 



t-H 

1-1 

T—1 

t-H 




t-H 

CO 

■ 

1> 

05 


CO 

1 


o 


t-H 


00 

O' 

CO 

05 

1 


co 

lO 

CO 


CM 

CO 

no 

CO 




CO 

b- 


CO 

CM 

05 

b- 




lO 

^F 


^F 

^F 


co 














oo 

o 

| 

no 

^F 

CO 

no 

1 

{/) 

i3 

b- 

00 


no 

t-H 

CM 

t-H 

1 

c 

CM 

co 

CM 


no 

CO 

oo 

05 


CD 

CO 

b- 


CO 

CO 


b- 


_| 


lO 



"nF 

^F 

no 

co 


CD 










> 

q 


CM 

05 

| 

t- 

no 

no 

oo 

b- 



t-H 

b- 


CM 

no 

CM 

05 

o 


i 

b- 

^F 


05 

CO 

O 

o 

00 



co 

b- 


CO 


CO 

00 

co 



lO 

^F 



"nF 

CO 

CO 

'nF 













r-H 

CO 

| 

'nF 

oo 

05 

co 

1 



CO 

CO 


oo 

t- 

t-H 

CM 

1 


lO 

b- 

05 


no 

t-H 

t- 

CM 




co 

b- 


b- 

b- 

CM 

t-H 




lO 

xF 


"nF 

^F 

no 

"nF 









CM 




II 


Ph 








II 


P> 






0 




P> 






bO 




•H 






P 




+ 






0 



P« 

P 



rH 


O 

S 



0 

0 

CQ 


O 

0 

0 




CD 

0 


Qj 

P^ 



0 



H 

H 

< 

Qh 

Qh 

0 

0 




O 

O 


d 

< 

O 

o 




CQ 

CQ 

Qh 



P 

p 

1—1 



cd 

0 

O 

Qh 

Qh 

0 

0 

0 



CQ 

CQ 

CQ 

O 

O 

Qh 

Qh 



15 



Table 3: Total density of the factor matrices A and B for different k on the Delicious data. 


k = 

7 

10 

15 

30 

SaBoTeur 

0.00016 

0.00031 

0.00029 

0.00029 

CP_APR 

0.20219 

0.14786 

0.10312 

0.05396 

ParCube 

0.28186 

0.21069 

0.16658 

0.09900 

Walk’n’Merge 

— 

0.00022 

— 

— 


Table 4: Average wall-clock running times in seconds for real-world data sets. Averages are over 5 re-starts 
for MovieLens and 10 for Resolver. Both data sets used k = 15 clusters. 


MovieLens 

Resolver 

SaBoTeur 

45.6 

0.02 

SaBoTeur+itUp 

303.6 

0.11 

CP_APR 

196.6 

22.00 

ParCube 

24.8 

7.68 

Walk’n’Merge 

207.5 

2.15 


which still is faster than ParCube or CP_APR. The density of the data does not affect the speed of SaBoTeur, 
while it does have a significant effect to CP_APR and ParCube. SaBoTeur+itUp is less affected by density, and 
more affected by the size of the data, than the other methods. 

5.3.5 Generalization with Real-World Data 

We repeated the generalization test we did with synthetic data with three real-world data sets MovieLens, 
Resolver, and TracePort - keeping 85% of the frontal slices as training data while the remaining 15% we used 
for testing. The results can be seen in Table [5j We used k = 15 clusters for MovieLens and TracePort, and 
10 clusters for the Resolver data. On each data set we conducted 10 repetitions of the training phase, and 
selected the one with the smallest training error. 

As expected, the Unrestricted BTC achieves lower training error, but larger testing error. This agrees with 
our synthetic experiments, and strongly indicates that using the rank-1 centroids helps to avoid overfitting. 

5.3.6 Selecting the Number of Clusters 

We tested the use of MDL to select the number of clusters with two real-world data sets, Enron and MAG. To 
that end, we ran SaBoTeur with every possible number of clusters (i.e. from having all slices in one cluster 
to having one cluster for each slice). Further, we computed the description length of representing the data 
with no clusters, i.e. having the full data explained in the L(D \ M ) part. The description lengths for the 
different numbers of clusters are presented in Figure |4j For Enron, the number of clusters that gave the 
smallest description length was 3, while for MAG it was 16. 

Table 5: Training and testing errors on real-world data. The number of clusters was k = 15 for MovieLens 
and TracePort and k = 10 for Resolver. 




MovieLens 

Resolver 

TracePort 

Train 

SaBoTeur+itUp 

40075 

1282 

9730 


Unrestricted BTC 

39351 

1122 

8843 

Test 

SaBoTeur+itUp 

7258 

225 

1144 


Unrestricted BTC 

8256 

274 

1554 
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(a) Enron (b) MAG 

Figure 4: Description length vs. number of clusters. 


Both Enron and MAG show non-smooth behaviour in the description length. This is likely partially 
due to SaBoTeur’s randomized behaviour and partially due to the fact that SaBoTeur does not consider 


the description length when it builds the clustering. Nevertheless, especially with Enron (Figure 4(a)), the 
selection of the best number of clusters - in the MDL’s sense - seems clear. 


5.3.7 Interpretability of the Results 

For the final experiment with the real-world data, we studied some of the results we obtained from SaBoTeur 
in order to see if they can provide insights to the data. We report results from three data sets: MAG, Delicious, 
and Last.FM. For MAG, we can interpret all three modes (movies, actors, and genres); for the other two, we 
can only interpret two of the three modes (tags and URLs, and tags and artists, respectively), as the third 
mode corresponds to user IDs. 

For, MAG, we used 16 clusters (as suggested by MDL), and clustered the genres. As there are only 20 
genres in the data, this means many clusters contained only one genre; this, however, is not surprising, as we 
would expect the sets of movies that fall in different genres be generally rather different. The results picked 
famous movies of the genre as the centroids: for example, the cluster of animation and comedy genres had 
the 1995 animation Toy Story appearing in its centroid. Another cluster, containing genres adventure and 
fantasy, had as its centroid Peter Jackson’s Lord of the Rings trilogy, and its main cast. Similarly to Toy 
Story , Lord of the Rings movies are arguably stereotypical adventure-fantasy movies. 

The centroids for the MAG data contained very few movies (e.g. the three Lord of the Rings movies). This 
is due to the fact that the data sets are very sparse, and the algorithm can only pick up the most prominent 
elements in the centroids. With even sparse data sets, such as Delicious and Last.FM, the centroids get even 
sparse, perhaps containing just a singleton element from some mode. This is a common problem to many 


Boolean methods (see, e.g. Miettinen et al 2008). To alleviate it, we used a method proposed by Miettinen 


et al (2008), and weighted the reconstruction error. Normally, representing a 0 as a 1 and representing a 1 as 
a 0 both increase the error by 1, but for the next experiments, we set the weight so that representing a 1 
with a 0 causes 10 times more error as representing 0 as a 1. 

For Delicious and Last.FM, we mined 30 clusters. Even with the 10-fold weight, results for Delicious 
were rather sparse (indicating that the users’ behaviour was very heterogeneous). Some clusters were very 
informative, however, such as one with tags software , apple, mac, and osx that contained bookmarked pages 
on HFS for Windows, iVPN, Object-Oriented Programming with Objective-C, and iBooks and ePub, among 
others. For Last.FM, the clusters were generally more dense, as users tend to tag artists more homogeneously. 
For example, the cluster containing the tags pop and mb contained the superstar artists such as Beyonce, 
Britney Spears, Katy Perry, and Miley Cyrus (and many more). But importantly, one could also find 
less-known artists: in a cluster containing tags dance, trance, house, progressive trance, techno, and vocal 
trance, we found artists such as Infected Mushroom, T.M. Revolution, Dance Nation, and RuPaul. 
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5.4 Discussion 


All the experiments show that SaBoTeur can hold its own even against methods that are theoretically 
superior: For reconstruction error (or similarity) with both synthetic (Sections 5.2.1-5.2.31 and real-world 
data (Section 5.3.2), SaBoTeur (and SaBoTeur+itUp) achieve results that are the best or close to the best 
results, notwithstanding that other methods, especially CP_APRo/i and ParCube 0 /i, benefit from significantly 
less constrained forms of decomposition, coupled with the post-hoc rounding. These relaxations also come 


with a price, as both methods create significantly denser factor matrices, and - as can be seen in Sections 5.2.4 


and |5.3.4| - are unable to scale to denser data. Compared to the Boolean CP factorization methods, BCP_ALS 
and Walk ; n’Merge, SaBoTeur is comparable (or sometimes clearly better) in reconstruction error and density 
of the factor matrices, while being significantly more scalable. 

Our hypothesis that the rank-1 centroids help with overfitting was also confirmed (Sections 5.2.5 and 5.3.5): 
While the Unrestricted BTC obtained lower training errors (as expected), SaBoTeur obtained better results 
with testing data in all real-world data sets and in most synthetic experiments (with TracePort, SaBoTeur’s 
testing error was more than 25% better than that of Unrestricted BTC’s). 

Our experiments also show that MDL can be used to select the number of clusters automatically, even 
though we do think that more studies are needed to better understand the behaviour of the description 
length. Also, when studying the results, it seems obvious that SaBoTeur can return results that are easy to 
interpret and insightful; in part, this is due to the Boolean algebra, and in part, due to the clustering. 

The last open question is whether one should use SaBoTeur or SaBoTeur+itUp. Judging by the results, 
SaBoTeur+itUp can provide tangible benefits over SaBoTeur, however, with a significant price in the running 
time. Thence, we think that it is best to start with SaBoTeur, and only if the user thinks she would benefit 
from more accurate results, move to SaBoTeur+itUp. 


6 Related Work 


Tensor clusterings have received some research interest in recent years. The existing work can be divided 


roughly into two separate approaches: on one hand Jegelka et al (2009) study the problem of clustering 


simultaneously all modes of a tensor (tensor co-clustering), and on the other hand, Huang et al (2008) and 
Liu et al ( 2010| ) (among others) study the problem where only one mode is clustered and the remaining 
modes are represented using a low-rank approximation. The latter form is closer to what we study in this 
paper, but the techniques used for the continuous methods do not apply to the binary case. 

Normal tensor factorizations are well-studied, dating back to the late Twenties. The Tucker and CP 


decompositions were proposed in the Sixties (Tucker 1966) and Seventies (Carroll and Chang 1970 Harshman 


1970), respectively. The topic has nevertheless attained growing interest in recent years, both in numerical 


linear algebra and computer science communities. For a comprehensive study of recent work, see |Kolda and| 
Bader (2009), and the recent work on scalable factorizations by Papalexakis et al (2012). 

The first algorithm for Boolean CP factorization was presented by Leenen et al (1999), although without 
much analysis. Working in the framework of formal tri-concepts, Belohlavek et al (2012) studied the 


exact Boolean tensor decompositions, while 

Ignatov et al 

(201 

1) studied the problem of finding dense 

rank-1 subtensors from binary tensors. In data mining 

Miettinen 

(2011 

1, studied computational complexity 

and sparsity of the Boolean tensor decompositions. 

Recently E 

Ados and Miettinen 

(2013b 

) proposed a 


scalable algorithm for Boolean CP and Tucker decompositions, and applied that algorithm for information 


n-ary extensions of closed itemsets (see, 


e.g. 

Cerf et al 

2009 

2013 

jMiettinen 

(2011 

)• 


2013). For more on these methods and their 


Miettinen et al (2008) presented the Discrete Basis Partitioning problem for clustering binary data and 


using binary centroids. They gave a (10 + e) approximation algorithm based on the idea that any algorithm 
that can solve the so-called binary graph clustering (where centroids must be rows of the original data) within 
factor /, can solve the arbitrary binary centroid version within factor 2/. Recently, Jiang (2014) gave a 
2-approximation algorithm for the slightly more general case with k + 1 clusters with one centroid restricted 
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to be an all-zero vector. 

The maximization dual of binary clustering, the hypercube segmentation problem, was studied by lKleinberg| 
et al (1998||2004| and they also gav e three algorithms for the problem, obtaining an approximati on ratio of 


2(72— 1). Later, Seppanen (2005) proved that this ratio is tight for this type of algorithms and Alon and 


Sudakov (1999) presented a PTAS for the problem. 


Recently, Kim and Candan (2011) proposed the tensor-relational model, where relational information 
is represented as tensors. They defined the relational algebra over the (decompositions of) the tensors, 
noting that in the tensor-relational model, the tensor decomposition becomes the costliest operation. Hence, 


their subsequent work (Kim and Candan 2012 2014) has concentrated on faster decompositions in the 
tensor-relational framework. 


7 Conclusions and Future Work 

We have studied the problem of clustering one mode of a 3-way binary tensor while simultaneously reducing 
the dimensionality in the two other modes. This problem bears close resemblance to the Boolean CP tensor 
decomposition, but the additional clustering constraint makes the problem significantly different. The main 
source of computational complexity, the consideration of overlapping factors in the tensor decomposition, 
does not play a role in BCPC. This lets us design algorithms with provable approximation guarantees better 
than what is known for the Boolean matrix and tensor decompositions. 

Our experiments show that the SaBoTeur algorithm obtains comparable reconstruction errors compared 
to the much less restricted tensor factorization methods while obtaining sparse factor matrices and overall 
best scalability. We also showed that restricting the cluster centroids to rank-1 matrices significantly helped 
the generalization to unseen data. 

The idea of restricting the clustering centroids to a specific structure could be applied to other types 
of data, as well. Whether that will help with the curse of dimensionality remains an interesting topic for 
the future research, but based on the results we presented in this paper, we think this approach could have 
potential. On the other hand, one could also consider higher-rank centroids, moving from BCPC towards 
Unrestricted BTC. In a sense, the rank of the centroid can be seen as a regularization parameter, but the 
viability of this approach requires more research, both in Boolean and in continuous domains. 

We also argue that the similarity - as opposed to distance or error - is often more meaningful metric for 
studying the (theoretical) performance of data mining algorithms due to its robustness towards small errors 
and more pronounced effects of large errors. To assess whether similarity is better than dissimilarity, more 
methods, existing and novel, should be analysed, and the results should be contrasted to those obtained when 
studying the dissimilarity and real-world performance. Our hypothesis is that good theoretical performance 
in terms of similarity correlates better with good real-world performance than good (or bad) performance 
correlates with the error. 

Finally, our algorithms could be used to provide fast decompositions of Boolean tensors in the tensor- 
relational model. 
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